import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
df = pd.read_csv(r'data/used_car_dataset.csv')
print(df.shape)
(9997, 14)
df.head()
| price | year | manufacturer | condition | cylinders | fuel | odometer | transmission | type | paint_color | F1 | F2 | F3 | F4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21978.0 | 2016.0 | ford | like new | 6 cylinders | gas | 80813.0 | automatic | SUV | black | 138 | 2.472286 | 0.059031 | b |
| 1 | 4185.0 | 2008.0 | ford | good | 8 cylinders | gas | 201800.0 | automatic | SUV | white | 415 | 2.285245 | 0.046328 | c |
| 2 | 7693.0 | 2002.0 | ford | excellent | 6 cylinders | gas | 145000.0 | automatic | pickup | white | 535 | 1.861461 | 0.158554 | b |
| 3 | 15778.0 | 2016.0 | ford | excellent | 4 cylinders | gas | 50103.0 | automatic | sedan | white | 3435 | 2.331671 | 0.213665 | c |
| 4 | 10520.0 | 2005.0 | ford | good | 8 cylinders | gas | 207663.0 | automatic | truck | white | 1212 | 2.061082 | 0.388724 | b |
## Counts for cat vars
# df['manufacturer'].value_counts()
# df['condition'].value_counts()
# df['cylinders'].value_counts()
# df['fuel'].value_counts()
# df['transmission'].value_counts()
# df['type'].value_counts()
# df['paint_color'].value_counts()
# df['F4'].value_counts()
df.describe()
| price | year | odometer | F1 | F2 | F3 | |
|---|---|---|---|---|---|---|
| count | 9997.000000 | 9997.000000 | 9.305000e+03 | 9997.000000 | 9997.000000 | 9997.000000 |
| mean | 14233.961989 | 2009.817345 | 1.139147e+05 | 1534.257377 | 2.224515 | 0.100247 |
| std | 12595.955311 | 8.847370 | 1.306907e+05 | 2138.048191 | 0.226997 | 0.100804 |
| min | 0.000000 | 1922.000000 | 0.000000e+00 | 4.000000 | 0.167352 | -0.313948 |
| 25% | 5594.000000 | 2007.000000 | 6.702100e+04 | 308.000000 | 2.131407 | 0.030308 |
| 50% | 10333.000000 | 2012.000000 | 1.110000e+05 | 698.000000 | 2.255297 | 0.101295 |
| 75% | 19389.000000 | 2015.000000 | 1.512140e+05 | 2013.000000 | 2.361656 | 0.167654 |
| max | 402498.000000 | 2021.000000 | 9.999999e+06 | 36773.000000 | 2.728576 | 0.526848 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9997 entries, 0 to 9996 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 9997 non-null float64 1 year 9997 non-null float64 2 manufacturer 9997 non-null object 3 condition 9997 non-null object 4 cylinders 9997 non-null object 5 fuel 9997 non-null object 6 odometer 9305 non-null float64 7 transmission 9997 non-null object 8 type 9997 non-null object 9 paint_color 9997 non-null object 10 F1 9997 non-null int64 11 F2 9997 non-null float64 12 F3 9997 non-null float64 13 F4 9997 non-null object dtypes: float64(5), int64(1), object(8) memory usage: 1.1+ MB
Your CEO said: “The dataset describes conditions of various used cars and their current prices. I would like to learn what drives prices of used cars.
Look at the dataset and find the main factors that affect the value of a car – and then explain it to me.
Additionally, assess the impact of some special modifications (denoted by F1, F2, F3 and F4 in your dataset) on the price. This would help us to understand, if we should make the modifications before selling a car or not. I would like to see the report, describing your main findings, on my desk, on Thursday, February 11, 2021 at 10 A.M. “
Hint: You are asked to find general trends in the data. Report whatever you think is the most important. Your CEO doesn’t want to see a list that is 20-times long. She would like to learn just about some general trends. To give you an example, one general trend could be “The price decrease with the age of the car. Holding all other factors constant, with each year, the price of a car decreases by \$570. However, these dynamics are not constant. Value of younger cars decreases faster than the value of an old car. For example, the value of cars that are less than 5 years old, decreases nearly $2,500 per year.” (This is just an example; your numbers might be different). Your second task you have to check both, the impact and the statistical significance of the F1-F4 attributes for making the price predictions.
### NOTE:
## Due to skewed datasets, we should clean data by:
## fill missing vals (median imputation)
## scale the datasets (minmaxscaler)
## remove outliers
## and one-hot encoding for categorical vars (to prep for model)
## MISSING VALUES
# df.isnull().sum() ## check for missing vals
# ## fill in missing values with mean imputation
# df['odometer'].fillna(df['odometer'].mean(), inplace=True)
# # df.isnull().sum() ## check for missing vals
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
cat_cols = df.select_dtypes(include=['object']).columns
min_max_scaler = MinMaxScaler()
nums_scaled = pd.DataFrame(min_max_scaler.fit_transform(df[num_cols]))
nums_scaled.columns = num_cols
## Outlier removal: get indicies of outlier locations
outlier_locs = []
for col in num_cols:
curr = df[col]
mean, std = np.mean(curr), np.std(curr)
cut_off = std * 3
lower, upper = mean - cut_off, mean + cut_off
# print(col, lower, upper)
for idx, val in curr.items():
if val < lower or val > upper:
outlier_locs.append(idx)
outliers = list(set(outlier_locs))
## df2 is just df without the outliers
df2 = df.drop(df.index[outliers])
nums_scaled2 = nums_scaled.drop(nums_scaled.index[outliers])
print(df.shape, df2.shape)
(9997, 14) (9603, 14)
## plot data
df2.hist(bins=50, figsize=(16,8))
plt.savefig('images/hist_1.png')
## View correlations btwn features
## focus on effect on PRICE
plt.figure(figsize=(10,6))
sns.heatmap(df2.corr().abs(), annot=True, cmap="hot_r")
plt.savefig('images/heatmap_1.png')
# scatter_matrix(df2[num_cols], figsize=(12,10))
# plt.savefig('images/scatter_matrix_num_cols.png')
## boxplots for numerical vars
fig, axes = plt.subplots(2, 3, sharex = False, figsize=(16,8))
fig.suptitle('Boxplots')
for i, var in enumerate(num_cols):
if i < 3:
row = 0
elif i >= 3:
row = 1
col = i % 3
sns.boxplot(ax=axes[row,col], data=df2, x=var)
plt.savefig('images/boxplots_1.png')
## Categorical vars and PRICE
fig, axes = plt.subplots(2, 4, sharex = False, figsize=(16,8))
fig.suptitle('Categorical Vars and Price')
for i, var in enumerate(cat_cols):
if i < 4:
row = 0
elif i >= 4:
row = 1
col = i % 4
axes[row][col] = sns.stripplot(x=var, y='price', data=df2, ax=axes[row][col])
fig.tight_layout()
plt.savefig('images/catplots_and_price_1.png')
## TODO: write about trends btwn price and various features (how they correlate as they go up or down) in the report
## Part 2:
## TODO: Do hyp testing for stat sig of F1 - F4 attributes
Your Technical Manager said: “I would like you to propose a predictive model, that can be used to determine price of a used car. The problem is that the state-law demands that this model be easily interpretable. It means that we are restricted to use simple methods like Linear Regression, Ridge Regression, LASSO and Elastic Net. Additionally, we need to know how accurate the model is. You must choose the best model and report its root mean square error. Describe everything in your report and I will study it carefully”. Hint: In the most typical approach, you need to build three datasets: a training set, a validation set and a test set. You will use validation set to determine the best model; the test set to estimate model accuracy. In your report you should describe how you trained the models, how you selected the best one and how you tested its performance at the end.
## TRANSFORM STEP (After outlier removal)
## Transforms the dataset to prep for model
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('minmaxscaler', MinMaxScaler())
])
full_pipeline = ColumnTransformer([
("numerical", num_pipeline, num_cols),
("categorical", OneHotEncoder(), cat_cols)
])
arr_cleaned = full_pipeline.fit_transform(df2)
print(arr_cleaned.shape)## After outlier removal
(9603, 30)
pd.DataFrame(arr_cleaned).head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.423868 | 0.864865 | 0.180614 | 0.016909 | 0.783441 | 0.432530 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.080712 | 0.648649 | 0.451015 | 0.051861 | 0.625397 | 0.411489 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.148367 | 0.486486 | 0.324069 | 0.067003 | 0.267310 | 0.597377 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.304295 | 0.864865 | 0.111978 | 0.432934 | 0.664625 | 0.688662 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.202889 | 0.567568 | 0.464119 | 0.152429 | 0.435985 | 0.978624 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 30 columns
plt.hist(arr_cleaned[:,0])
(array([2151., 2707., 1564., 1056., 638., 490., 446., 326., 160.,
65.]),
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
<BarContainer object of 10 artists>)
## split in train/test/valid
X = arr_cleaned[:,1:]
y = arr_cleaned[:,0] ## price variable
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2021)
print(X_train.shape, X_test.shape)
(9603, 29) (9603,) (7682, 29) (1921, 29)
## TODO:
## spot check models
## try diff models compare results
## evaluate models using diff metrics
## k fold CV
The Senior Developer took you aside and said: “My task is to deploy your model to production. But I cannot deploy a paper-report. I need your code. However, remember that I am not a Data Scientist list you. I have a different expertise. I will read your code, but you should make sure that I can follow and understand it – and that I know how to use it.”
Hint: In the ideal case, people should be able to take your code, run it and recreate all your results. In a less ideal case, it should be a demonstration of typical run. The code should demonstrate your approach end-to-end. People should just specify the path to the dataset, run it and see final results. Another name for this is a technical demo. At your future work, you might be quite often asked to demo your results. People will expect you to present an end-to- end example where you read the raw data, train your model and evaluate the results of the predictions.